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We analyze the occurrence frequencies of over 15 million words recorded in millions of books published 
during the past two centuries in seven different languages. For all languages and chronological subsets of the 
data we confirm that two scaling regimes characterize the word frequency distributions, with only the more 
common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric 
scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a 
decreasing marginal need for new words, a feature that is likely related to the underlying correlations 
between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the 
corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This 
"cooling pattern" forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is 
dynamical in nature. 



Books in libraries and attics around the world constitute an immense "crowd-sourced" historical record that 
traces the evolution of culture back beyond the limits of oral history. However, the disaggregation of written 
language into individual books makes the longitudinal analysis of language a difficult open problem. To this 
end, the book digitization project at Google Inc. presents a monumental step forward providing an enormous, 
publicly accessible, collection of written language in the form of the Google Books Ngram Viewer web application 1 . 
Approximately 4% of all books ever published have been scanned, making available over 10 7 occurrence time 
series (word-use trajectories) that archive cultural dynamics in seven different languages over a period of more 
than two centuries. This dataset highlights the utility of open "Big Data," which is the gateway to "metaknow- 
ledge" 2 , the knowledge about knowledge. A digital data deluge is sustaining extensive interdisciplinary research 
efforts towards quantitative insights into the social and natural sciences 3 7 . 

"Culturomics," the use of high-throughput data for the purpose of studying human culture, is a promising new 
empirical platform for gaining insight into subjects ranging from political history to epidemiology 8 . As first 
demonstrated by Michel et al. 8 , the Google n-gram dataset is well-suited for examining the microscopic properties 
of an entire language ecosystem. Using this dataset to analyze the growth patterns of individual word frequencies, 
Petersen et al. 9 recently identified tipping points in the life trajectory of new words, statistical patterns that govern 
the fluctuations in word use, and quantitative measures for cultural memory. The statistical properties of cultural 
memory, derived from the quantitative analysis of individual word- use trajectories, were also investigated by Gao 
et al. 10 , who found that words describing social phenomena tend to have different long-range correlations than 
words describing natural phenomena. 

Here we study the growth and evolution of written language by analyzing the macroscopic scaling patterns that 
characterize word-use. Using the Google 1-gram data collected at the 1-year time resolution over the period 1800- 
2008, we quantify the annual fluctuation scale of words within a given corpora and show that languages can be 
said to "cool by expansion." This effect constitutes a dynamic law, in contrast to the static laws of Zipf and Heaps 
which are founded upon snapshots of single texts. The Zipf law 1117 , quantifying the distribution of word 
frequencies, and the Heaps law 1318 20 , relating the size of a corpus to the vocabulary size of that corpus, are classic 
paradigms that capture many complexities of language in remarkably simple statistical patterns. While these laws 
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have been exhaustively tested on relatively small snapshots of empir- 
ical data, here we test the validity of these laws using extremely large 
corpora. 

Interestingly, we observe two scaling regimes in the probability 
density functions of word usage, with the Zipf law holding only for 
the set of more frequently used words, referred to as the "kernel 
lexicon" by Ferrer i Cancho et al. 14 . The word frequency distribution 
for the rarely used words constituting the "unlimited lexicon" 14 obeys 
a distinct scaling law, suggesting that rare words belong to a distinct 
class. This "unlimited lexicon" is populated by highly technical 
words, new words, numbers, spelling variants of kernel words, and 
optical character recognition (OCR) errors. 

Many new words start in relative obscurity, and their eventual 
importance can be under-appreciated by their initial frequency. 
This fact is closely related to the information cost of introducing 
new words and concepts. For single topical texts, Heaps observed 
that the vocabulary size exhibits sub-linear growth with document 
size 18 . Extending this concept to entire corpora, we find a scaling 
relation that indicates a decreasing "marginal need" for new words 
which are the manifestation of cultural evolution and the seeds for 
language growth. We introduce a pruning method to study the role of 
infrequent words on the allometric scaling properties of language. By 
studying progressively smaller sets of the kernel lexicon we can better 
understand the marginal utility of the core words. The pattern that 
arises for all languages analyzed provides insight into the intrinsic 
dependency structure between words. 

The correlations in word use can also be author and topic de- 
pendent. Bernhardsson et al. recently introduced the "metabook" 
concept 19,20 , according to which word-frequency structures are 
author-specific: the word-frequency characteristics of a random 
excerpt from a compilation of everything that a specific author could 
ever conceivably write (his/her "metabook") should accurately 
match those of the author's actual writings. It is not immediately 
obvious whether a compilation of all the metabooks of all authors 
would still conform to the Zipf law and the Heaps law. The immense 
size and time span of the Google n-gram dataset allows us to examine 
this question in detail. 

Results 

Longitudinal analysis of written language. Allometric scaling 
analysis 21 is used to quantify the role of system size on general 
phenomena characterizing a system, and has been applied to 
systems as diverse as the metabolic rate of mitochondria 22 and city 
growth 23 29 . Indeed, city growth shares two common features with 
the growth of written text: (i) the Zipf law is able to describe the 
distribution of city sizes regardless of country or the time period of 
the data 26 , and (ii) city growth has inherent constraints due to 
geography, changing labor markets and their effects on oppor- 
tunities for innovation and wealth creation 27-28 , just as vocabulary 
growth is constrained by human brain capacity and the varying 
utilities of new words across users 14 . 

We construct a word counting framework by first defining the 
quantity M,(f) as the number of times word i is used in year t. Since 
the number of books and the number of distinct words grow dra- 
matically over time, we define the relative word use, /(t), as 
the fraction of the total body of text occupied by word i in the same 
year 

fi{t) = Ui{t)/N u {t), (1) 

where the quantity N u (t)= ^\_^'w,(r) is the total number of 
indistinct word uses while N w (t) is the total number of distinct words 
digitized from books printed in year t. Both the N w ("types" giving 
the vocabulary size) and the N u ("tokens" giving the size of the body 
of text) are generally increasing over time. 



The Zipf law and the two scaling regimes. Zipf investigated a 
number of bodies of literature and observed that the frequency of 
any given word is roughly inversely proportional to its rank 11 , with 
the frequency of the z-ranked word given by the relation 

/(z)~z-«, (2) 

with a scaling exponent £ ~ 1 . This empirical law has been confirmed 
for a broad range of data, ranging from income rankings, city 
populations, and the varying sizes of avalanches, forest fires 30 and 
firm size 31 to the linguistic features of nonconding DNA 32 . The Zipf 
law can be derived through the "principle of least effort," which 
minimizes the communication noise between speakers (writers) 
and listeners (readers) 16 . The Zipf law has been found to hold for a 
large dataset of English text 14 , but there are interesting deviations 
observed in the lexicon of individuals diagnosed with schizoph- 
renia 15 . Here, we also find statistical regularity in the distribution 
of relative word use for 1 1 different datasets, each comprising more 
than half a million distinct words taken from millions of books 8 . 

Figure 1 shows the probability density functions P(f) resulting 
from data aggregated over all the years (A,B) as well as over 1-year 
periods as demonstrated for the year t = 2000 (C,D). Regardless of 
the language and the considered time span, the probability density 
functions are characterized by a striking two-regime scaling, which 
was first noted by Ferrer i Cancho and Sole 14 , and can be quantified as 

(f-*-, if f</x ["unlimited lexicon"! 
P(f)~< 1 (3) 

\f~* + , if/>/x ["kernel lexicon"]. K ' 

These two regimes, designated "kernel lexicon" and "unlimited lex- 
icon," are thought to reflect the cognitive constraints of the brain's 
finite vocabulary 14 . The specialized words found in the unlimited 
lexicon are not universally shared and are used significantly less 
frequently than the words in the kernel lexicon. This is reflected in 
the kink in the probability density functions and gives rise to the 
anomalous two-scaling distribution shown in Fig. 1. 

The exponent a + and the corresponding rank- frequency scaling 
exponent ( in Eq. (2) are related asymptotically by 14 

8+-1 + 1/C, (4) 

with no analogous relationship for the unlimited lexicon values st- 
and Table I lists the average a + and a_ values calculated by 
aggregating a,± values for each year using a maximum likelihood 
estimator for the power-law distribution 33 . We characterize the two 
scaling regimes using a crossover region around/ x = 10~ 5 to distin- 
guish between a_ and a + : (i) 10~ 8 —f— 10~ 6 corresponds to a_ and 
(ii) 10~ 4 — /— 10~ ! corresponds to a + . For the words that satisfy/ > 
fx that comprise the kernel lexicon, we verify the Zipf scaling law £ = 
1 (corresponding to a ~ 2) for all corpora analyzed. For the unlim- 
ited lexicon regime/ < / x , however, the Zipf law is not obeyed, as we 
find a_ ~ 1.7. Note that oe_ is significantly smaller in the Hebrew, 
Chinese, and the Russian corpora, which suggests that a more gen- 
eralized version of the Zipf law 14 may be needed, one which is slightly 
language-dependent, especially when taking into account the usage 
of specialized words from the unlimited lexicon. 

The Heaps law and the increasing marginal returns of new words. 

Heaps observed that vocabulary size, i.e. the number of distinct 
words, exhibits a sub-linear growth with document size 18 . This 
observation has important implications for the "return on 
investment" of a new word as it is established and becomes 
disseminated throughout the literature of a given language. As a 
proxy for this return, Heaps studied how often new words are 
invoked in lieu of preexisting competitors and examined the 
linguistic value of new words and ideas by analyzing the relation 
between the total number of words printed in a body of text N u , 
and the number of these which are distinct N w , i.e. the vocabulary 
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Figure 1 | Two-regime scaling distribution of word frequency. The kink in the probability density functions P(f) occurs around / x ~ 10~ 5 for each 
corpora analyzed (see legend). (A,B) Data from all years are aggregated into a single distribution. (C,D) P(f) comprising data from only year t = 2000 
providing evidence that the distribution is stable even over shorter time frames and likely emerges in corpora that are sufficiently large to be 
comprehensive of the language studied. For details concerning the scaling exponents we refer to Table I and the main text. 



size 18 . The marginal returns of new words, dNJdN w quantifies the 
impact of the addition of a single word to the vocabulary of a corpus 
on the aggregate output (corpus size). 

For individual books, the empirically-observed scaling relation 
between N u and N w obeys 

N W ~(N U )\ (5) 

with b < 1, with Eq. (5) referred to as "the Heaps law". It has sub- 
sequently been found that Heaps' law emerges naturally in systems 
that can be described as sampling from an underlying Zipf 



distribution. In an information theoretic formulation of the the 
abstract concept of word cost, B. Mandelbrot predicted the relation 
b = l/£ in 1961 34 , where £ is the scaling exponent corresponding to 
<x+, as in Eqs. (3) and (4). This prediction is limited to relatively small 
texts where the unlimited lexicon, which manifests in the a- regime, 
does not play a significant role. A mathematical extension of this 
result for general underlying rank-distributions is also provided by 
Karlin 35 using an infinite urn scheme, and extended to broader 
classes of heavy-tailed distributions recently by Gnedin et al 36 . 
Recent research efforts using stochastic master equation techniques 



Table I | Summary of the scaling exponents characterizing the Zipf law and the Heaps law. To calculate av(f | f c ) (see Figs. 6 and 7) we use only 
the relatively common words that meet the criterion that their average word use (/;) over the entire word history is larger than a threshold f c = 
1 0/Min[N u )(fj] listed in the first column for each corpus. The fa values shown are calculated using all words (U c = 0). The "unlimited lexicon" 
scaling exponent is calculated for 1 0~ 8 < f< 1 0~ 6 and the "kernel lexicon" exponent o£ + (/) is calculated for 1 0~ 4 <f< 1 0 _1 using the 
maximum likelihood estimator method for each year. The average and standard deviation ({• • ■} + (?) listed are computed using the a+[f\ 
and <x-[fj values over the 209-year period 1 800-2008 (except for Chinese, which is calculated from 1 950-2008 data). We show the Zipf 
scaling exponent calculated as f = 1 /((a+) - 1 ). The last column indicates the ft scaling exponents from Fig. 7(A) 
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to model the growth of a book have also predicted this intrinsic 
relation between Zipf s law and Heaps' law 13,37,38 . 

Figure 2 confirms a sub-linear scaling (b < 1) between N u and N w 
for each corpora analyzed. These results show how the marginal 
returns of new words are given by 

£~W'->", (6) 



which is an increasing function of N w for b < 1. Thus, the relative 
increase in the induced volume of written languages is larger for new 
words than for old words. This is likely due to the fact that new words 
are typically technical in nature, requiring additional explanations 
that put the word into context with pre-existing words. Specifically, a 
new word requires the additional use of preexisting words as a result 
of both (i) the explanation of the content of the new word using 
existing technical terms, and (ii) the grammatical infrastructure 
necessary for that explanation. Hence, there are large spillovers in 
the size of the written corpus that follow from the intricate depend- 
ency structure of language stemming from the various grammatical 
roles 39 ' 40 . 

In order to investigate the role of rare and new words, we calculate 
N u and N w using only words that have appeared at least U c times. We 
select the absolute number of uses as a word use threshold because a 
word in a given year can not appear with a frequency less than l/N w 
hence any criteria using relative frequency would necessarily intro- 
duce a bias for small corpora samples. This choice also eliminates 
words that can spuriously arise from Optical Character Recognition 
(OCR) errors in the digitization process and also from intrinsic spel- 
ling errors and orthographic spelling variations. 

Figures 3 and 4 show the relational dependence of N u and N w on 
the exclusion of low- frequency words using a variable cutoff U c = 2" 
with n = 0 ... 11. As U c increases the Heaps scaling exponent 
increases from b ~ 0.5, approaching b ~ 1, indicating that core 
words are structurally integrated into language as a proportional 
background. Interestingly, Altmann et al. 41 recently showed that 
"word niche" can be an essential factor in modeling word use 
dynamics. New niche words, though they are marginal increases to 
a language's lexicon, are themselves anything but "marginal" - they 
are core words within a subset of the language. This is particularly the 
case in online communities in which individuals strive to distinguish 



themselves on short timescales by developing stylistic jargon, high- 
lighting how language patterns can be context dependent. 

We now return to the relation between Heaps' law and Zipf s law. 
Table I summarizes the b values calculated by means of ordinary least 
squares regression using U c = 0 to relate N u (t) to N w (t). For U c = 1 
we find that b ~ 0.5 for all languages analyzed, as expected from 
Heaps law, but for U c > 8 the b value significantly deviates from 0.5, 
and for U c > 1000 the b value begins to saturate approaching unity. 
Considering that a + ~2 implies £ ~ 1 for all corpora, Figures 3 and 4 
shows that we can confirm the relation b(U c ) ~ l/( only for the more 
pruned corpora that require relatively large U c . This hidden feature of 
the scaling relation highlights the underlying structure of language, 
which forms a dependency network between the common words of 
the kernel lexicon and their more esoteric counterparts in the unlim- 
ited lexicon. Moreover, the function 8N w /dN u ~ (N u ) b ~ 1 is a mono- 
tonically decreasing function for b < 1, demonstrating the decreasing 
marginal need for additional words as a corpora grows. In other 
words, since we get more and more "mileage" out of new words in 
an already large language, additional words are needed less and less. 

Corpora size and word-use fluctuations. Lastly, it is instructive to 
examine how vocabulary size N w and the overall size of the corpora 
N u affect fluctuations in word use. Figure 5 shows how N w (t) and 
N u (t) vary over time over the past two centuries. Note that, apart 
from the periods during the two World Wars, the number of words 
printed, which we will refer to as the "literary productivity", has been 
increasing over time. The number of distinct words (vocabulary size) 
has also increased reflecting basic social and technological 
advancement 8 . 

To investigate the role of fluctuations, we focus on the logarithmic 
growth rate, commonly used in finance and economics 



r i (f)=ln/Kf+Ar)-ln./i(f)=ln 



fi(t) 



(7) 



to measure the relative growth of word use over 1-year periods, Af = 
1 year. Recent quantitative analysis on the distribution P(r) of word 
use growth rates r,(f) indicates that annual fluctuations in word use 
deviates significantly from the predictions of null models for lan- 
guage evolution 9 . 



10 

io 4 
io 3 



S-H 




Chinese 

n * g5tf "*°'| 0.77(2) 



0.26(1) 




10 



•8 

O 10 

o 

> 





1 ' 




German 










J 0.60(1) \ 


=- i a0 rTi nu i 


i i i 1 1 1 1 





10 

to e 

IQ 5 

1 



10 

10 s 
IQ 4 




10' 10 10 10 10 



4 io' 



!()■ 





1 1 1 Mil! 


1 111 1 11111 ™J» 1 


i ii iy 




English 




B.J 






■^j 0-5HD : ;, s 




in 


i i i 


a 


(»'• 




io 8 io 10 







French 


II 1 1 1 1 1 

^ r ^ r _] 0.52(1) 
i 1 


oso 

1 1 1 






io 8 


10 




Hebrew 


1 1 i 1 1 ""^J^ 


a I 






fi^ 0 ^ 3-47(1) 






Sf° 1 "7 


J 0.65(1) 

7 1 1 i i ii 


,1,1 1 


o' 


io 5 


io" io 7 


io 8 




1 1 II 1 1 M| 

Spanish 


1 Jijji 








_\ 0.51 






1 








10 



10 



10 



10 



Corpus size, N (t) 



Figure 2 | Allometric scaling of language. Scatter plots of the output corpora size N u given the empirical vocabulary size N w using all data ( U c = 0) over 
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corpora of extremely rare words. Our longitudinal language analysis provides insight into the structural importance of the most frequent words which are used 
more times per appearance and which play a crucial role in the usage of new and rare words. 



We define an aggregate fluctuation scale, <r r (t\f c ), using a frequency 
cutoff/,, oc l/Min[N u (t)] to eliminate infrequently used words. The 
quantity Min[N u (t)] is the minimum corpora size over the period of 
analysis, and so l/Min[N u (t)] is an upper bound for the minimum 
observed frequency for words in the corpora. Figure 6 shows o r (t\f c ), 
the standard deviation of r,(r) calculated across all words that satisfy 
the condition (f,) a/ c for words with lifetime T, a 10 years, using/ c 
= l/Min[N u (t)]. Visual inspection suggests a general decrease in 
a r (t\f c ) over time, marked by sudden increases during times of polit- 
ical conflict. Hence, the persistent increase in the volume of written 
language is correlated with a persistent downward trend what could 



be thought of as the "system temperature" ff r (r|/ c ): as a language 
grows and matures it also "cools off. 

Since this cooling pattern could arise as a simple artifact of an 
independent identically distributed (i.i.d) sampling from an increas- 
ingly large dataset, we test the scaling of a r (t\f c ) with corpora size. 
Figure 7(A) shows that for large N u (t), each language is characterized 
by a scaling relation 

with language-dependent scaling exponent ji ~ 0.08-0.35. We use/ c 
= 10/ Min[N u (t)], which defines the frequency threshold for the 
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Figure 4 | Pruning reveals the variable marginal return of words. The Heaps scaling exponent h depends on the extent of the inclusion of the rarest 
words. For a given corpora and U c value we make a scatter plot between N w ( f I U c ) and N u ( tl U c ) using words with u, (t) > U a using the same data color- U c 
correspondence as in Fig. 3. (Panel Inset) We use OLS estimation to estimate the scaling exponent b{ U c ) for the model N w ( t\ U c ) ~ [N u ( 1 1 U c ) ] 6 to show 
that b( U c ) increases from approximately 0.5 towards unity as we prune the corpora of extremely rare words. Our longitudinal language analysis provides 
insight into the structural importance of the most frequent words which are used more times per appearance and which play a crucial role in the usage of 
new and rare words. 



inclusion of a given word in our analysis. There are two candidate 
null models which give insight into the limiting behavior of fi. The 
Gibrat proportional growth model predicts ft = 0 and the Yule- 
Simon urn model predicts p = 1/2 42 . We observe ji < 1/2, which 
indicates that the fluctuation scale decreases more slowly with 
increasing corpora size than would be expected from the Yule- 
Simon urn model prediction, deducible via the "delta method" for 
determining the approximate scaling of a distribution and its stand- 
ard deviation a 43 . 

To further compare the roles of the kernel lexicon versus the 
unlimited lexicon, we apply our pruning method to quantify the 



dependence of the scaling exponent fS on the fluctuations arising 
from rare words. We omit words from our calculation of a r (t\ U c ) 
if their use Uj(t) in year t falls below the word-use threshold U c . 
Fig. 7(B) shows that p(U c ) increases from values close to 0 to 
values less than 1/2 as U c increases exponentially. An increasing 
P(U C ) confirms our conjecture that rare words are largely respons- 
ible for the fluctuations in a language. However, because of the 
dependency structure between words, there are residual fluc- 
tuation spillovers into the kernel lexicon likely accounting for 
the fact that p < 1/2 even when the fluctuations from the unlim- 
ited lexicon are removed. 
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Figure 5 | Literary productivity and vocabulary size in the Google Inc. 1 -gram dataset over the past two centuries. (A) Total size of the different corpora 
N„{t\ U c ) overtime, calculated by using words that satisfy u£t) a U c = 16 to eliminate extremely rare 1 -grams. (B) Size of the written vocabulary N w ( t\ U c ) 
over time, calculated under the same conditions as (A). 



A size-variance relation showing that larger entities have smaller 
characteristic fluctuations was also demonstrated at the scale of indi- 
vidual words using the same Google n-gram dataset 9 . Moreover, this 
size-variance relation is strikingly analogous to the decreasing 
growth rate volatility observed as complex economic entities (i.e. 
firms or countries) increase in size 42,44-48 , which strengthens the ana- 
logy of language as a complex ecosystem of words governed by com- 
petitive forces. 



Further possible explanations for {] < 1/2 is that language growth 
is counteracted by the influx of new words which tend to have 
growth-spurts around 30-50 years following their birth in the writ- 
ten corpora 9 . Moreover, the fluctuation scale cr r (t\f c ) is positively 
influenced by adverse conditions such as wars and revolutions, since 
a decrease in N u (t) may decrease the competitive advantage that old 
words have over new words, allowing new words to break through. 
The globalization effect, manifesting from increased human mobility 
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Figure 6 | Non-stationarity in the characteristic growth fluctuation of word use. The standard deviation ff r ( f l/ c ) of the logarithmic growth rate r,-( f) is 
presented for all examined corpora. There is an overall decreasing trend arising from the increasing size of the corpora, as depicted in Fig. 5(A). On the 
other hand, the steady production of new words, as depicted in Fig. 5(B) counteracts this effect. We calculate o T {t\f c ) using the relatively common words 
that meet the criterion that their average word use (fi) over the entire word history T,- (using words with lifetime T, > 10 years) is larger than a threshold/,. 
= l/Mm[N„(f)] (see Table I). 
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Figure 7 | Growth fluctuation of word use scale with the size of the corpora. (A) Depicted is the quantitative relation in Eq.(8) between o" r (fl/ c ) and the 
corpus size N u ( t\f c ) . We calculate a r ( t\f c ) using the relatively common words that meet the criterion that their average word use (f ; ) over the entire word 
history (using words with lifetime T, a 10 years) is larger than a threshold f c = W/Min[N u (t)] (see Table I). We show the language-dependent scaling 
value /} ~ 0.08-0.35 in each panel. For each language we show the value of the ordinary least squares best-fit /? value with the standard error in parentheses. 
(B) Summary of fi(U c ) exponents calculated using a use-threshold U a instead of a frequency threshold f c as used in (A). Error bars indicate the standard 
error in the OLS regression. We perform this additional analysis in order to provide alternative insight into the role of extremely rare words. For increasing 
U c the ji{ U c ) value for each corpora increases from /? ~ 0.05 to /J < 0.25. This language pruning method quantifies the role of new rare words (also 
including OCR errors, spelling and other orthographic variants), which are the significant components of language volatility. 



during periods of conflict, is also responsible for the emergence of 
new words within a language. 

Discussion 

A revolutionary description of language and culture requires many 
factors and much consideration 4 " 50 . While scientific and technolo- 
gical advances are largely responsible for written language growth as 
well as the birth of many new words 9 , socio-political factors also play 
a strong role. For example, the sexual revolution of the 1960s trig- 
gered the sudden emergence of the words "girlfriend" and "boy- 
friend" in the English corpora 1 , illustrating the evolving culture of 
romantic courting. Such technological and socio-political perturba- 
tions require case-by-case analysis for any deeper understanding, as 
demonstrated comprehensively by Michel et al. 8 . 

Here we analyzed the macroscopic properties of written lan- 
guage using the Google Books database 1 . We find that the word 
frequency distribution P(f) is characterized by two scaling regimes. 
While frequently used words that constitute the kernel lexicon 
follow the Zipf law, the distribution has a less-steep scaling regime 
quantifying the rarer words constituting the unlimited lexicon. 
Our result is robust across languages as well as across other data 
subsets, thus extending the validity of the seminal observation by 
Ferrer i Cancho and Sole 14 , who first reported it for a large body of 
English text. The kink in the slope preceding the entry into the 
unlimited lexicon is a likely consequence of the limits of human 
mental ability that force the individual to optimize the usage of 
frequently used words and forget specialized words that are sel- 
dom used. This hypothesis agrees with the "principle of least 
effort" that minimizes communication noise between speakers 
(writers) and listeners (readers), which in turn may lead to the 
emergence of the Zipf law 16 . 

Using an extremely large written corpora that documents the 
profound expansion of language over centuries, we analyzed the 
dependence of vocabulary growth on corpus growth and validate 
the Heaps law scaling relation given by Eq. 5. Furthermore we sys- 
tematically prune the corpora data using a word occurrence thresh- 
old U„ and comparing the resulting b(U c ) value to the ( ~ 1 value, 



which is stable since it is derived from the "kernel" lexicon. We 
conditionally confirm the theoretical prediction £ ~ 1/£> 1334 ~ 38 , which 
we validate only in the case that the extremely rare "unlimited" 
lexicon words are not included in the data sample (see Figs. 3 and 4). 

The economies of scale (b < 1) indicate that there is an increasing 
marginal return for new words, or alternatively, a decreasing mar- 
ginal need for new words, as evidenced by allometric scaling. This can 
intuitively be understood in terms of the increasing complexities and 
combinations of words that become available as more words are 
added to a language, lessening the need for lexical expansion. 
However, a relationship between new words and existing words is 
retained. Every introduction of a word, from an informal setting (e.g. 
an expository text) to a formal setting (e.g. a dictionary) is yet another 
chance for the more common describing words to play out their 
respective frequencies, underscoring the hierarchy of words. This 
can be demonstrated quite instructively from Eq. (6) which implies 

that for b = 1/2 that ccN w , meaning that it requires a quantity 

proportional to the vocabulary size N w to introduce a new word, or 
alternatively, that a quantity proportional to iV„, necessarily results 
from the addition. 

Though new words are needed less and less, the expansion of 
language continues, doing so with marked characteristics. Taking 
the growth rate fluctuations of word use to be a kind of temperature, 
we note that like an ideal gas, most languages "cool" when they 
expand. The fact that the relationship between the temperature 
and corpus volume is a power law, one may, loosely speaking, liken 
language growth to the expansion of a gas or the growth of a com- 
pany 42,44 " 48 . In contrast to the static laws of Zipf and Heaps, we note 
that this finding is of a dynamical nature. 

Other aspects of language growth may also be understood in terms 
of expansion of a gas. Since larger literary productivity imposes a 
downward trend on growth rate fluctuations — which also implies 
that the ranking of the top words and phases becomes more stable 51 
— productivity itself can be thought of as a kind of inverse pressure in 
that highly productive years are observed to "cool" a language off. 
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Also, it is during the "high-pressure" low productivity years that new 
words tend to emerge more frequently. 

Interestingly, the appearance of new words is more like gas con- 
densation, tending to cancel the cooling brought on by language 
expansion. These two effects, corpus expansion and new word "con- 
densation," therefore act against each other. Across all corpora we 
calculate a size- variance scaling exponent 0 < /? < 1/2, bounded by 
the prediction of p = 0 (Gibrat growth model) and = 1/2 (Yule- 
Simon growth model) 42 . 

In the context of allometric relations, Bettencourt et al. 27 note that 
the scaling relations describing the dynamics of cities show an 
increase in the characteristic pace of life as the system size grows, 
whereas those found in biological systems show decrease in char- 
acteristic rates as the system size grows. Since the languages we 
analyzed tend to "cool" as they expand, there may be deep-rooted 
parallels with biological systems based on principles of efficiency 16 . 
Languages, like biological systems demonstrate economies of scale (b 
< 1) manifesting from a complex dependency structure that mimics 
a hierarchical "circulatory system" required by the organization of 
language 39,52-56 and the limits of the efficiency of the speakers/writers 
who exchange the words 19,41-57 . 
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